Hacemos analisis factorial para reducir las variables en otras variables resumen. Mientras la clusterizaciĂ³n agrupaba filas, la factorizaciĂ³n agrupa columnas. Pero, al igual que en clusterizaciĂ³n, queremos saber si las nuevas variables tienen un nombre, al cual se le denomina tĂ©cnicamente variable latente. En esta sesiĂ³n exploraremos la data a ver quĂ© emerge.
Para esta sesiĂ³n trabajaremos con la data de estos links:
library(htmltab)
# links
happyL=c("https://en.wikipedia.org/wiki/World_Happiness_Report",
'//*[@id="mw-content-text"]/div/table/tbody')
demoL=c("https://en.wikipedia.org/wiki/Democracy_Index",
'//*[@id="mw-content-text"]/div/table[2]/tbody')
# carga
happy = htmltab(doc = happyL[1],which = happyL[2],encoding = "UTF-8")
demo = htmltab(doc = demoL[1], which = demoL[2], encoding = "UTF-8")
# limpieza
happy[,]=lapply(happy[,], trimws,whitespace = "[\\h\\v]") # no blanks
demo[,]=lapply(demo[,], trimws,whitespace = "[\\h\\v]") # no blanks
library(stringr) # nombres simples
names(happy)=str_split(names(happy)," ",simplify = T)[,1]
names(demo)=str_split(names(demo)," ",simplify = T)[,1]
## Formateo
# Eliminemos columnas que no usaremos esta vez:
happy[,c('Overall','Rank')]=NULL
demo[,c('Changes','Rank')]=NULL
# También debemos tener nombres diferentes en los scores antes del merge:
names(happy)[names(happy)=="Score"]="ScoreHappy"
names(demo)[names(demo)=="Score"]="ScoreDemo"
# Tipo de variables:
## En demo:
demo[,-c(1,8,9)]=lapply(demo[,-c(1,8,9)],as.numeric)
# En happy:
happy[,-c(1)]=lapply(happy[,-c(1)],as.numeric)
# sin perdidos:
happy=na.omit(happy)
demo=na.omit(demo)
Presta atenciĂ³n al merge. Usualmente hacemos merge por default y luego perdemos varias filas:
nrow(merge(happy,demo))
## [1] 147
Hagamos un nuevo merge, donde nos quedemos con TODOS los paises que no estaban en uno u otro data frame:
HappyDemo=merge(happy,demo,all.x=T, all.y=T)
Esta vez HappyDemo tiene varios paises de mĂ¡s, pero con valores perdidos y nombres que no pudieron coincidir. Veamos:
# formateando a
# HappyDemo[!complete.cases(HappyDemo),]
library(knitr)
library(kableExtra)
kable(HappyDemo[!complete.cases(HappyDemo),],type='html')%>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
font_size = 10)
| Country | ScoreHappy | GDP | Social | Healthy | Freedom | Generosity | Perceptions | ScoreDemo | Electoral | FunctioÂning | PoliticalparticiÂpation | Politicalculture | Civilliberties | Regimetype | Region | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4 | Angola | NA | NA | NA | NA | NA | NA | NA | 3.72 | 2.25 | 2.86 | 5.56 | 5.00 | 2.94 | Authoritarian | Sub-Saharan Africa |
| 26 | Cape Verde | NA | NA | NA | NA | NA | NA | NA | 7.78 | 9.17 | 7.36 | 6.67 | 6.88 | 8.82 | Flawed democracy | Sub-Saharan Africa |
| 33 | Congo (Brazzaville) | 4.812 | 0.673 | 0.799 | 0.508 | 0.372 | 0.105 | 0.093 | NA | NA | NA | NA | NA | NA | NA | NA |
| 34 | Congo (Kinshasa) | 4.418 | 0.094 | 1.125 | 0.357 | 0.269 | 0.212 | 0.053 | NA | NA | NA | NA | NA | NA | NA | NA |
| 37 | Cuba | NA | NA | NA | NA | NA | NA | NA | 2.84 | 0.00 | 3.57 | 3.33 | 4.38 | 2.94 | Authoritarian | Latin America |
| 40 | Democratic Republic of the Congo | NA | NA | NA | NA | NA | NA | NA | 1.13 | 0.00 | 0.00 | 1.67 | 3.13 | 0.88 | Authoritarian | Sub-Saharan Africa |
| 42 | Djibouti | NA | NA | NA | NA | NA | NA | NA | 2.77 | 0.42 | 1.29 | 3.89 | 5.63 | 2.65 | Authoritarian | Sub-Saharan Africa |
| 47 | Equatorial Guinea | NA | NA | NA | NA | NA | NA | NA | 1.92 | 0.00 | 0.43 | 3.33 | 4.38 | 1.47 | Authoritarian | Sub-Saharan Africa |
| 48 | Eritrea | NA | NA | NA | NA | NA | NA | NA | 2.37 | 0.00 | 2.14 | 1.67 | 6.88 | 1.18 | Authoritarian | Sub-Saharan Africa |
| 52 | Fiji | NA | NA | NA | NA | NA | NA | NA | 5.85 | 6.58 | 5.36 | 6.11 | 5.63 | 5.59 | Hybrid regime | Asia & Australasia |
| 63 | Guinea-Bissau | NA | NA | NA | NA | NA | NA | NA | 2.63 | 4.92 | 0.00 | 2.78 | 3.13 | 2.35 | Authoritarian | Sub-Saharan Africa |
| 64 | Guyana | NA | NA | NA | NA | NA | NA | NA | 6.15 | 6.92 | 5.36 | 6.11 | 5.00 | 7.35 | Flawed democracy | Latin America |
| 83 | Kosovo | 6.100 | 0.882 | 1.232 | 0.758 | 0.489 | 0.262 | 0.006 | NA | NA | NA | NA | NA | NA | NA | NA |
| 115 | North Korea | NA | NA | NA | NA | NA | NA | NA | 1.08 | 0.00 | 2.50 | 1.67 | 1.25 | 0.00 | Authoritarian | Asia & Australasia |
| 117 | Northern Cyprus | 5.718 | 1.263 | 1.252 | 1.042 | 0.417 | 0.191 | 0.162 | NA | NA | NA | NA | NA | NA | NA | NA |
| 119 | Oman | NA | NA | NA | NA | NA | NA | NA | 3.06 | 0.08 | 3.93 | 2.78 | 4.38 | 4.12 | Authoritarian | Middle East & North Africa |
| 121 | Palestine | NA | NA | NA | NA | NA | NA | NA | 3.89 | 3.33 | 0.14 | 7.78 | 4.38 | 3.82 | Authoritarian | Middle East & North Africa |
| 122 | Palestinian Territories | 4.696 | 0.657 | 1.247 | 0.672 | 0.225 | 0.103 | 0.066 | NA | NA | NA | NA | NA | NA | NA | NA |
| 124 | Papua New Guinea | NA | NA | NA | NA | NA | NA | NA | 6.03 | 6.92 | 6.07 | 3.89 | 5.63 | 7.65 | Flawed democracy | Asia & Australasia |
| 131 | Republic of the Congo | NA | NA | NA | NA | NA | NA | NA | 3.11 | 2.17 | 2.50 | 3.89 | 3.75 | 3.24 | Authoritarian | Sub-Saharan Africa |
| 142 | Somalia | 4.668 | 0.000 | 0.698 | 0.268 | 0.559 | 0.243 | 0.270 | NA | NA | NA | NA | NA | NA | NA | NA |
| 145 | South Sudan | 2.853 | 0.306 | 0.575 | 0.295 | 0.010 | 0.202 | 0.091 | NA | NA | NA | NA | NA | NA | NA | NA |
| 148 | Sudan | NA | NA | NA | NA | NA | NA | NA | 2.70 | 0.00 | 1.79 | 5.56 | 5.00 | 1.18 | Authoritarian | Middle East & North Africa |
| 149 | Suriname | NA | NA | NA | NA | NA | NA | NA | 6.98 | 9.17 | 6.43 | 6.67 | 5.00 | 7.65 | Flawed democracy | Latin America |
| 157 | Timor-Leste | NA | NA | NA | NA | NA | NA | NA | 7.19 | 9.58 | 6.29 | 5.56 | 6.88 | 7.65 | Flawed democracy | Asia & Australasia |
| 159 | Trinidad & Tobago | 6.192 | 1.231 | 1.477 | 0.713 | 0.489 | 0.185 | 0.016 | NA | NA | NA | NA | NA | NA | NA | NA |
| 160 | Trinidad and Tobago | NA | NA | NA | NA | NA | NA | NA | 7.16 | 9.58 | 7.14 | 6.11 | 5.63 | 7.35 | Flawed democracy | Latin America |
| 168 | United States | NA | NA | NA | NA | NA | NA | NA | 7.96 | 9.17 | 7.14 | 7.78 | 7.50 | 8.24 | Flawed democracy | North America |
| 169 | United States of America | 6.892 | 1.433 | 1.457 | 0.874 | 0.454 | 0.280 | 0.128 | NA | NA | NA | NA | NA | NA | NA | NA |
De lo anterior date cuenta que, por un lado, hay paises que les falta un bloque de indicadores, y que en muchos casos los nombres estĂ¡n mal escritos. Podemos recuperar algunos, pero en la data original:
# cambiemos a nombres usados por otra tabla:
## en demo por happy
demo[demo$Country=="Democratic Republic of the Congo",'Country']="Congo (Kinshasa)"
demo[demo$Country=="Republic of the Congo",'Country']="Congo (Brazzaville)"
demo[demo$Country=="Trinidad and Tobago",'Country']="Trinidad & Tobago"
demo[demo$Country=="North Macedonia",'Country']="Macedonia"
demo[demo$Country=="United States",'Country']="United States of America"
## en happy por demo
happy[happy$Country=="Palestinian Territories",'Country']="Palestine"
Luego de esos ajustes veamos:
HappyDemo=merge(happy,demo) # re creando HappyDemo
nrow(HappyDemo)
## [1] 151
En efecto se recuperaron paĂses; asĂ quedarĂ¡.
El anĂ¡lisis factorial exploratorio requiere que hagamos algunas observaciones previas.
dontselect=c("Country","ScoreHappy","ScoreDemo","Regimetype","Region")
select=setdiff(names(HappyDemo),dontselect)
theData=HappyDemo[,select] # sin los Scores ni nombre de paĂs.
# esta es:
library(polycor)
corMatrix=polycor::hetcor(theData)$correlations
library(ggcorrplot)
ggcorrplot(corMatrix)
ggcorrplot(corMatrix,
p.mat = cor_pmat(corMatrix),
insig = "blank")
Si puedes ver bloques correlacionados, hay esperanza de un buen analisis factorial.
library(psych)
psych::KMO(corMatrix)
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = corMatrix)
## Overall MSA = 0.86
## MSA for each item =
## GDP Social Healthy
## 0.84 0.88 0.89
## Freedom Generosity Perceptions
## 0.81 0.54 0.73
## Electoral FunctioÂning PoliticalparticiÂpation
## 0.80 0.90 0.92
## Politicalculture Civilliberties
## 0.88 0.86
Aqui hay dos pruebas:
cortest.bartlett(corMatrix,n=nrow(theData))$p.value>0.05
## [1] FALSE
library(matrixcalc)
is.singular.matrix(corMatrix)
## [1] FALSE
fa.parallel(theData,fm = 'ML', fa = 'fa')
## Parallel analysis suggests that the number of factors = 3 and the number of components = NA
Se sugieren 3, veamos:
library(GPArotation)
resfa <- fa(theData,nfactors = 3,cor = 'mixed',rotate = "varimax",fm="minres")
##
## mixed.cor is deprecated, please use mixedCor.
print(resfa$loadings)
##
## Loadings:
## MR1 MR3 MR2
## GDP 0.295 0.885 0.112
## Social 0.330 0.729 0.102
## Healthy 0.371 0.791 0.136
## Freedom 0.190 0.350 0.514
## Generosity -0.121 0.569
## Perceptions 0.242 0.667
## Electoral 0.941 0.206
## FunctioÂning 0.747 0.437 0.323
## PoliticalparticiÂpation 0.749 0.329
## Politicalculture 0.590 0.273 0.499
## Civilliberties 0.904 0.317
##
## MR1 MR3 MR2
## SS loadings 3.543 2.652 1.441
## Proportion Var 0.322 0.241 0.131
## Cumulative Var 0.322 0.563 0.694
print(resfa$loadings,cutoff = 0.5)
##
## Loadings:
## MR1 MR3 MR2
## GDP 0.885
## Social 0.729
## Healthy 0.791
## Freedom 0.514
## Generosity 0.569
## Perceptions 0.667
## Electoral 0.941
## FunctioÂning 0.747
## PoliticalparticiÂpation 0.749
## Politicalculture 0.590
## Civilliberties 0.904
##
## MR1 MR3 MR2
## SS loadings 3.543 2.652 1.441
## Proportion Var 0.322 0.241 0.131
## Cumulative Var 0.322 0.563 0.694
Cuando logramos que cada variable se vaya a un factor, tenemos una estructura simple.
fa.diagram(resfa)
resfa$crms
## [1] 0.04366199
resfa$RMSEA
## RMSEA lower upper confidence
## 0.11671263 0.08321419 0.14351745 0.90000000
resfa$TLI
## [1] 0.91647
sort(resfa$communality)
## Generosity Freedom Perceptions
## 0.3389670 0.4224908 0.5088023
## Social Politicalculture PoliticalparticiÂpation
## 0.6500405 0.6712479 0.6741318
## Healthy FunctioÂning GDP
## 0.7814498 0.8533605 0.8827731
## Civilliberties Electoral
## 0.9234226 0.9301124
sort(resfa$complexity)
## Generosity Electoral GDP
## 1.092530 1.098968 1.253519
## Civilliberties Perceptions PoliticalparticiÂpation
## 1.259556 1.283187 1.395229
## Social Healthy FunctioÂning
## 1.438097 1.489756 2.031831
## Freedom Politicalculture
## 2.076039 2.391297
¿QuĂ© nombres les darĂas?
as.data.frame(resfa$scores)%>%head()
## MR1 MR3 MR2
## 1 -0.3784192 -1.512818957 -1.4977970
## 2 0.3743268 0.009079471 -0.5806275
## 3 -0.7709189 0.435036712 -0.7990531
## 4 0.7666525 0.448104287 -0.8329555
## 5 0.2210220 -0.026986387 -1.0749388
## 6 1.1984140 0.736304849 1.5216622
HappyDemoFA=cbind(HappyDemo[1],as.data.frame(resfa$scores))
library(plotly)
plot_ly(data=HappyDemoFA, x = ~MR1, y = ~MR2, z = ~MR3, text=~Country) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Demo'),
yaxis = list(title = 'Tranquilidad'),
zaxis = list(title = 'Bienestar')))
RECORDANDO:
library(fpc)
library(cluster)
library(dbscan)
# YA NO NECESITAS CMD para HappyDemoFA[,c(2:4)]
g.dist.cmd = daisy(HappyDemoFA[,c(2:4)], metric = 'euclidean')
kNNdistplot(g.dist.cmd, k=3)
abline(h=0.63,col='red')
Para tener una idea de cada quien:
resDB=fpc::dbscan(g.dist.cmd, eps=0.63, MinPts=3,method = 'dist')
HappyDemoFA$clustDB=as.factor(resDB$cluster)
aggregate(cbind(MR1, MR2,MR3) # dependientes
~ clustDB, # nivel
data = HappyDemoFA, # data
max) # operacion
## clustDB MR1 MR2 MR3
## 1 0 -0.2808243 1.9591079 2.083118
## 2 1 1.2654673 2.1312288 1.124425
## 3 2 -1.0822343 0.5173523 1.569088
## 4 3 -1.2675411 1.4032479 0.344846
## 5 4 1.0866344 -0.3962033 -1.522221
plot_ly(data=HappyDemoFA, x = ~MR1, y = ~MR2, z = ~MR3, text=~Country, color = ~clustDB) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Demo'),
yaxis = list(title = 'Tranquilidad'),
zaxis = list(title = 'Bienestar')))
Finalmente, veamos relaciones:
library(BBmisc)
HappyDemo$faDemo=normalize(HappyDemoFA$MR1,
method = "range",
margin=2, # by column
range = c(0, 10))
HappyDemo$faHappyInd=normalize(HappyDemoFA$MR2,
method = "range",
margin=2, # by column
range = c(0, 10))
HappyDemo$faHappySoc=normalize(HappyDemoFA$MR3,
method = "range",
margin=2, # by column
range = c(0, 10))
You can see them all here:
plot(HappyDemo[,c("ScoreDemo","ScoreHappy","faDemo","faHappyInd",
"faHappySoc")])
Aqui acaba la Unidad II, el analisis factorial confirmatorio se verĂ¡ en la siguiente Unidad.